######################################################################
############################## Overview  ###############################
######################################################################

Replication files and code for:

Zhang Han and Jennifer Pan. 2019. “CASM: A Deep-Learning Approach for Identifying 
Collective Action Events with Text and Image Data from Social Media” Sociological 
Methodology 49: 1-59.


######################################################################
######################## Core Models and Output ##########################
######################################################################

requirements.txt : required software versions and packages

modelfiles/ :

	This folder contains trained binary model files

	Model definition and weights of the first-stage CNN-RNN text model:
	|____text-stage1.json
	|____weights_text-stage1.h5

	Model definition and weights of the second-stage CNN-RNN text model:
	|____text-stage2.json
	|____weights_text-stage2.hdf5

	Model definition and weights of the image-based CNN model
	|____image.json
	|____weights_image.h5


output/ :
	|____protest_posts.csv:  predicted protest posts
- post_id: unique id for each post
- content: text of the post
- time: year-month-day time in Beijing time
- imgstr: filenames of images associated with post, blank if no images, multiple images separate by semicolon
- words: preprocessed text of post used as input for classifiers
- prov: province (sheng); based on latitude / longitude if available, text otherwise
- prov_code: provincial 2016 guobiao code
- city: prefecture (shi); based on latitude / longitude if available, text otherwise
- city_code: prefecture 2016 guobiao code
- county: county (xian); based on latitude / longitude if available, text otherwise
- county_code: county 2016 guobiao code
- prob_c1: predicted probability from first-stage CNN-RNN text classifier
- prob_img: predicted probability from CNN image classifier
- prob1_combined: combined predicted probability from first stage text and image classifiers
- prob_c2: predicted probability from second-stage CNN-RNN text classifier
- prob2_combined: combined predicted probability from second stage text and image classifiers
- date: date of post
- county_code_parsed: standardized county 2016 guobiao code
- county_code_dedup: deduplicated standardized county 2016 guobiao code
- event_id: unique event id associated with post that combines the county_code_parsed and date

	|____protest_events.csv:  output data, predicted protest posts
- event_id: unique event id with combined county_code_parsed and date
- forms: form of the protest “conventional”, “disruptive”, “violent”
- issues: all issues associated with the event; if multiple issues, issues are separated by a semi-colon.


plot/:
	Each subfolder contains python and/or R script to generate tables and figures included in the manuscript. Subfolders also include additional data files as needed. 



######################################################################
############################# Dependency files ##########################
######################################################################
lib/ : 
	Supplementary Python scripts and packages

|____word_preprocessing.py: take text input and produces segmented output
|____CASM_generate_predicted_probability_text.py: calls binary models to generate predicted probability based on text
|____CASM_generate_predicted_probability_image.py: calls binary models to generate predicted probability based on images
|____common_operations.py: other python functions, e.g., to parse dates
|____LSTM_text_dependency.py: dependency packages for second-stage CNN-RNN model
|____dependency.py: dependency packages for the first-stage CNN-RNN
|____CASM_c1_deep_text.py: used by CASM_generate_predicted_probability_text.py
|____CASM_c2_deep_text.py: used by CASM_generate_predicted_probability_text.py

supporting/ : 

	This folder contains other data files that we use in classification and analysis
|____stopwords1.txt: stopwords we used for preprocessing the input data (found in daily_c1_output/)
|____jieba.dict.big.txt: Jieba dictionary
|____high_frequency_protest_words.txt: high frequency (top 1000) words in Wickedonna; we add these words to the Jieba dictionary for segmentation
|____vocab_pos_KGP_50000.dict: dictionary used for the first-stage CNN-RNN text model
|____vocab_pos_grievance.dict: dictionary used for the second-stage CNN-RNN text model
|____propaganda_media_words.txt: dictionary used to identify government, and party social media accounts
|____village_level_dict.txt: dictionary whose keys are the administrative region names from 2016 guobiao and values are the 2016 guobiao codes.
|____village_level_dict_reversed.txt: dictionary whose keys are the 2016 guobiao codes and values are the administrative region names from 2016 guobiao.